feat(wasm): SmolLM2-135M fast default + Llama 1B quality option#37
Merged
feat(wasm): SmolLM2-135M fast default + Llama 1B quality option#37
Conversation
Two changes for WASM demo reliability and speed: 1. Model: switch from Qwen3.5-0.8B (base, gated, Qwen arch issues) to Llama 3.2 1B Instruct (verified working, good quality, public HuggingFace URL, proper Instruct tuning for chat). 2. Speed: add -DTQ_NO_Q4=1 to WASM build. Skips the load-time Q4 reconversion (GGUF Q4_K_M → FP32 → internal Q4) which was expensive and redundant for already-quantized models. Uses GGUF on-the-fly dequant instead. Saves several seconds of model init and reduces peak memory usage. Added compile-time #ifdef TQ_NO_Q4 guard in quant.h so it works in WASM (no getenv). Native builds are unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1B model causes 15-30s+ prefill hang in WASM — unusable as default. SmolLM2-135M: 135MB download, <2s prefill, ~10-20 tok/s in WASM. Quality is basic but responsive — proper demo experience. Llama 3.2 1B Instruct kept as "Quality" option for users willing to wait for the larger model. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7c38ac7 to
8330cb5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
1B model prefill takes 15-30s+ in WASM — feels broken. SmolLM2-135M: 135MB, <2s prefill, responsive.
🤖 Generated with Claude Code